NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Unraveling Complex Temporal Patterns in EHRs via Robust Irregular Tensor Factorization

Ren, Yifei; Zeng, Linghui; Lou, Jian; Xiong, Li; Ho, Joyce; Jiang, Xiaoqian; Bhavani, Sivasubramanium (June 2025, AMIA Jt Summits Transl Sci Proc)

Free, publicly-accessible full text available June 10, 2026
Cross-institutional dental electronic health record entity extraction via generative artificial intelligence and synthetic notes

https://doi.org/10.1093/jamiaopen/ooaf061

Chuang, Yao-Shun; Lee, Chun-Teh; Lin, Guo-Hao; Brandon, Ryan; Jiang, Xiaoqian; Walji, Muhammad F; Tokede, Oluwabunmi (May 2025, JAMIA Open)

Abstract BackgroundWhile most health-care providers now use electronic health records (EHRs) to document clinical care, many still treat them as digital versions of paper records. As a result, documentation often remains unstructured, with free-text entries in progress notes. This limits the potential for secondary use and analysis, as machine-learning and data analysis algorithms are more effective with structured data. ObjectiveThis study aims to use advanced artificial intelligence (AI) and natural language processing (NLP) techniques to improve diagnostic information extraction from clinical notes in a periodontal use case. By automating this process, the study seeks to reduce missing data in dental records and minimize the need for extensive manual annotation, a long-standing barrier to widespread NLP deployment in dental data extraction. Materials and MethodsThis research utilizes large language models (LLMs), specifically Generative Pretrained Transformer 4, to generate synthetic medical notes for fine-tuning a RoBERTa model. This model was trained to better interpret and process dental language, with particular attention to periodontal diagnoses. Model performance was evaluated by manually reviewing 360 clinical notes randomly selected from each of the participating site’s dataset. ResultsThe results demonstrated high accuracy of periodontal diagnosis data extraction, with the sites 1 and 2 achieving a weighted average score of 0.97-0.98. This performance held for all dimensions of periodontal diagnosis in terms of stage, grade, and extent. DiscussionSynthetic data effectively reduced manual annotation needs while preserving model quality. Generalizability across institutions suggests viability for broader adoption, though future work is needed to improve contextual understanding. ConclusionThe study highlights the potential transformative impact of AI and NLP on health-care research. Most clinical documentation (40%-80%) is free text. Scaling our method could enhance clinical data reuse.
more » « less
Free, publicly-accessible full text available May 2, 2026
FSLearning: An Efficient Federated Split Learning Framework for Privacy-Preserving Disease Prediction

https://doi.org/10.1007/978-3-031-95838-0_22

Li, Bin; Jiang, Xiaoqian; Hsu, Yu-Chun; Harmanci, Arif O; Gao, Hongchang; Shi, Xinghua (January 2025, Springer Nature Switzerland)

Full Text Available
Exploring the tradeoff between data privacy and utility with a clinical data analysis use case

https://doi.org/10.1186/s12911-024-02545-9

Im, Eunyoung; Kim, Hyeoneui; Lee, Hyungbok; Jiang, Xiaoqian; Kim, Ju Han (December 2024, BMC Medical Informatics and Decision Making)

Abstract BackgroundSecuring adequate data privacy is critical for the productive utilization of data. De-identification, involving masking or replacing specific values in a dataset, could damage the dataset’s utility. However, finding a reasonable balance between data privacy and utility is not straightforward. Nonetheless, few studies investigated how data de-identification efforts affect data analysis results. This study aimed to demonstrate the effect of different de-identification methods on a dataset’s utility with a clinical analytic use case and assess the feasibility of finding a workable tradeoff between data privacy and utility. MethodsPredictive modeling of emergency department length of stay was used as a data analysis use case. A logistic regression model was developed with 1155 patient cases extracted from a clinical data warehouse of an academic medical center located in Seoul, South Korea. Nineteen de-identified datasets were generated based on various de-identification configurations using ARX, an open-source software for anonymizing sensitive personal data. The variable distributions and prediction results were compared between the de-identified datasets and the original dataset. We examined the association between data privacy and utility to determine whether it is feasible to identify a viable tradeoff between the two. ResultsAll 19 de-identification scenarios significantly decreased re-identification risk. Nevertheless, the de-identification processes resulted in record suppression and complete masking of variables used as predictors, thereby compromising dataset utility. A significant correlation was observed only between the re-identification reduction rates and the ARX utility scores. ConclusionsAs the importance of health data analysis increases, so does the need for effective privacy protection methods. While existing guidelines provide a basis for de-identifying datasets, achieving a balance between high privacy and utility is a complex task that requires understanding the data’s intended use and involving input from data users. This approach could help find a suitable compromise between data privacy and utility.
more » « less
Full Text Available
CancerGPT for few shot drug pair synergy prediction using large pretrained language models

https://doi.org/10.1038/s41746-024-01024-9

Li, Tianhao; Shetty, Sandesh; Kamath, Advaith; Jaiswal, Ajay; Jiang, Xiaoqian; Ding, Ying; Kim, Yejin (December 2024, npj Digital Medicine)

Abstract Large language models (LLMs) have been shown to have significant potential in few-shot learning across various fields, even with minimal training data. However, their ability to generalize to unseen tasks in more complex fields, such as biology and medicine has yet to be fully evaluated. LLMs can offer a promising alternative approach for biological inference, particularly in cases where structured data and sample size are limited, by extracting prior knowledge from text corpora. Here we report our proposed few-shot learning approach, which uses LLMs to predict the synergy of drug pairs in rare tissues that lack structured data and features. Our experiments, which involved seven rare tissues from different cancer types, demonstrate that the LLM-based prediction model achieves significant accuracy with very few or zero samples. Our proposed model, the CancerGPT (with ~ 124M parameters), is comparable to the larger fine-tuned GPT-3 model (with ~ 175B parameters). Our research contributes to tackling drug pair synergy prediction in rare tissues with limited data, and also advancing the use of LLMs for biological and medical inference tasks.
more » « less
Full Text Available
Patient-Centered and Practical Privacy to Support AI for Healthcare

https://doi.org/10.1109/TPS-ISA62245.2024.00038

Liu, Ruixuan; Lee, Hong Kyu; Bhavani, Sivasubramanium V; Jiang, Xiaoqian; Ohno-Machado, Lucila; Xiong, Li (October 2024, IEEE)

Full Text Available
Ensuring Trust in Genomics Research

https://doi.org/10.1109/TPS-ISA58951.2023.00011

Ayday, Erman; Vaidya, Jaideep; Jiang, Xiaoqian; Telenti, Amalio (November 2023, IEEE)

Full Text Available
MULTIPAR: Supervised Irregular Tensor Factorization with Multi-task Learning for Computational Phenotyping

Ren, Yifei; Lou, Jian; Xiong, Li; Ho, Joyce; Jiang, Xiaoqian; Bhavani, Sivasubramanium (December 2023, Proceedings of Machine Learning Research 225:498–511, 2023)

Full Text Available
Federated generalized linear mixed models for collaborative genome-wide association studies

https://doi.org/10.1016/j.isci.2023.107227

Li, Wentao; Chen, Han; Jiang, Xiaoqian; Harmanci, Arif (August 2023, iScience)

Full Text Available
Multi-Task Learning for Post-transplant Cause of Death Analysis: A Case Study on Liver Transplant

Ding, Sirui Ding; Tan, Qiaoyu; Chang, Chia-yuan; Zou, Na; Zhang, Kai; Hoot, Nathan R.; Jiang, Xiaoqian; Hu, Xia (January 2024, AMIA Annual Symposium proceedings)

Full Text Available

« Prev Next »

Search for: All records